Introduction: You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions
According to the description, the data contains 32 observations on 11 (numeric) variables.
The data structure is:
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
From this plot, it appeared that a manual transmission was better than an automatic transmission for MPG. More basic summaries in Appendix A also suggested that the manual transmission was better.
However, in the above plot, transmission was not adjusted for other terms. From the multivariable comparison chart in Appendix A, MPG was obviously correlated with many terms.
So, we wanted fit the data with a model with only necessary terms.
From the multivariable comparison chart in the Appendix A, we saw that variables were correlated. So, MPG might not be affected only by transmission(am), but also by some other factors.
To avoid unnecessary terms and include only necessary terms in our model, we applied a backward selection approach to fit our model. This method slowly removed one factor at a time, starting with the term with the highest p-value.
We took out the term with the highest p-value if its p-value were higher than above a specified p-value threshold (5%). We updated the model and checked the next term with the highest P-value. This continued until all the remaining terms in the model were below a specified p-value threshold.
# Initial our model with all terms
fit <- lm(mpg ~ . , data = mtcars)
# Show initial P-values
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
# Start the backward selection approach
require(MASS)
step <- stepAIC(fit, direction="backward")
Steps can be found in the Appendix B.
step$anova # display results
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Final Model:
## mpg ~ wt + qsec + am
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 21 147.4944 70.89774
## 2 - cyl 1 0.07987121 22 147.5743 68.91507
## 3 - vs 1 0.26852280 23 147.8428 66.97324
## 4 - carb 1 0.68546077 24 148.5283 65.12126
## 5 - gear 1 1.56497053 25 150.0933 63.45667
## 6 - drat 1 3.34455117 26 153.4378 62.16190
## 7 - disp 1 6.62865369 27 160.0665 61.51530
## 8 - hp 1 9.21946935 28 169.2859 61.30730
# Update our model
fit_combined <- update(fit, mpg ~ wt + qsec + factor(am), data = mtcars)
# Show final P-values
summary(fit_combined)
##
## Call:
## lm(formula = mpg ~ wt + qsec + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## factor(am)1 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
As a result, all P-values were smaller than 5%. So, according to our criterion, we could reject there were no difference between our terms and MPG, which suggested that these three interaction terms were necessary.
Since factor(am)1 was maller than 5%, it implied there was difference for MPG between automatic and manual transmission.
Besides, the adjusted R-squared was 0.8336, meaning that 83.36% of the variance of the MPG could be explained by this model.
Since we wanted to know if there was any difference between automatic and manual transmission to MPG, and the model already included them with all necessary terms.
We saw that in addition to the intercept(referred to MPG), and slopes for “wt” and “qsec”, there was a third variable ‘factor(am)1’. When lm() encountered a factor variable with two levels, it created a new variable based on the second level.
In our case, the term “am” was a binary variable that took the value 1 if the transmission was manual, and 0 if it was automatic, and therefore ‘factor(am)1’ is created. The fitted equation for two groups can be written as
Automatic transmission: mpg = 9.6178 -3.9165(wt) + 1.2259(qsec) Manual transmission: mpg = 9.6178 + 2.9358 * 1 -3.9165(wt) + 1.2259(qsec)
As a result, if the transmission is manual, we will consider the term “2.9358(manual)”, otherwise, we will ignore the term “2.9358(manual)”.
We can see that the slopes for “wt and”qsec" are the same. The intercept for automatic transmission is 9.6178. The intercept for a manual transmission is 9.6178 + (2.9358 * 1) = 12.5536.
(0 = automatic(light gray), 1 = manual(dark gray))If we visualized the data (Weight, mpg, qsec) and fitted a linear model for two groups, (red = automatic; blue = manual), the regression plane (light gray) for the manual transmission positioned higher that the the regression plane (dark gray) for the automatic transmission in terms of mpg.
Since the intercept represents the mean value for MPG, it implies that for the same input to the formula, a manual transmission is better for MPG on average. And it will be 2.9358 Miles/(US) gallon better.
From the previous plot, we could see that the regression plane (light gray) for the manual transmission did not align well with the data (blue color). We might be able to adjust the model with an interaction term. It means to split the data into automatic and manual transmission and fit a linear model for each group.
# Split into two groups and fit two models
mtcars0 <- mtcars[mtcars$am==0,]
mtcars1 <- mtcars[mtcars$a==1,]
fit_am0 <- lm(mpg ~ wt + qsec, data = mtcars0)
summary(fit_am0)
##
## Call:
## lm(formula = mpg ~ wt + qsec, data = mtcars0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.118 -1.363 -1.011 1.181 3.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2489 6.7148 1.675 0.113316
## wt -2.9963 0.6636 -4.515 0.000352 ***
## qsec 0.9454 0.2945 3.210 0.005464 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.033 on 16 degrees of freedom
## Multiple R-squared: 0.7502, Adjusted R-squared: 0.7189
## F-statistic: 24.02 on 2 and 16 DF, p-value: 1.518e-05
fit_am1 <- lm(mpg ~ wt + qsec, data = mtcars1)
summary(fit_am1)
##
## Call:
## lm(formula = mpg ~ wt + qsec, data = mtcars1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6832 -0.9178 -0.0709 0.5567 4.0907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.1754 11.1991 1.802 0.101799
## wt -6.7544 1.4306 -4.721 0.000815 ***
## qsec 1.1810 0.4925 2.398 0.037434 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.244 on 10 degrees of freedom
## Multiple R-squared: 0.8896, Adjusted R-squared: 0.8675
## F-statistic: 40.29 on 2 and 10 DF, p-value: 1.639e-05
Because of the presence of the interaction term (transmission), both the slopes and intercepts are different for automatic and manual.
Also note that the coefficients of intercepts, having p-values highter than 0.05, are not statistically significant, meaning that the intercepts for automatic and manual are probably not very different. However, the slopes for automatic and manual are different.
The regression equation above can be split into separate equations for two groups: Automatic transmission: mpg = 11.2489 -2.9963(wt) + 0.9454(qsec) Manual transmission: mpg = 20.1754 -6.7544(wt) + 1.1810(qsec)
The regression planes are not parallel, but the interaction term is very small and statistically insignificant, so if we could visualize them they would be almost parallel.
The model predicts that mpg increase with lower weight, but for the automatic transmission the slope of this increase is smaller. The slopes intersect at between around 5000 and 2653 lbs. For transmission with less than 2653 lbs weight, the mpg of manual is higher compared to the mpg of automatic transmission with the same details.
1/4 time mile has an additional positive effect on mpg. The slopes intersect at between around 14.5 and 22.9 lbs. Thus the maunal transmission with “1/4 mile time” greater than 14.5, will have higher mpg compared to automatic ones of the same details.
Nevertheless, the difference in slopes is due to the interaction term which is not statistically significant. So there is no evidence in the data that slopes should be different. No adjustment is needed.
We diagnosed our model with factor variables by the variance inflation factor(VIF). The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model. If VIF values were greater than 10, it indicated that terms were highly collinear with the other terms in the model.
library(car)
vif(fit_combined)
## wt qsec factor(am)
## 2.482952 1.364339 2.541437
The result of VIF was satisfactory. They were all under 10. Besides, we also used diagnostic plots to provide checks for heteroscedasticity, normality, and influential observerations. You may find them in the Appendix D. The plots showed that there was no a very clear systematic pattern in our residuals.
As a result, we were confident with our first model with factor variables.
With all the previous analysis, we can conclude that our regression model is a fit.
Automatic transmission: mpg = 9.6178 -3.9165(wt) + 1.2259(qsec) Manual transmission: mpg = 9.6178 + 2.9358 * 1 -3.9165(wt) + 1.2259(qsec)
The weights and 1/4 mile time are related to mpg, but the adjusted estimate (mpg) depends on group status. It estimates that the manual transmission has a higher mpg than the automatic transmission.
# Summary for the automatic transmission
summary(mtcars[mtcars$am==0,])
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. :120.1 Min. : 62.0
## 1st Qu.:14.95 1st Qu.:6.000 1st Qu.:196.3 1st Qu.:116.5
## Median :17.30 Median :8.000 Median :275.8 Median :175.0
## Mean :17.15 Mean :6.947 Mean :290.4 Mean :160.3
## 3rd Qu.:19.20 3rd Qu.:8.000 3rd Qu.:360.0 3rd Qu.:192.5
## Max. :24.40 Max. :8.000 Max. :472.0 Max. :245.0
## drat wt qsec vs am
## Min. :2.760 Min. :2.465 Min. :15.41 Min. :0.0000 Min. :0
## 1st Qu.:3.070 1st Qu.:3.438 1st Qu.:17.18 1st Qu.:0.0000 1st Qu.:0
## Median :3.150 Median :3.520 Median :17.82 Median :0.0000 Median :0
## Mean :3.286 Mean :3.769 Mean :18.18 Mean :0.3684 Mean :0
## 3rd Qu.:3.695 3rd Qu.:3.842 3rd Qu.:19.17 3rd Qu.:1.0000 3rd Qu.:0
## Max. :3.920 Max. :5.424 Max. :22.90 Max. :1.0000 Max. :0
## gear carb
## Min. :3.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :3.000
## Mean :3.211 Mean :2.737
## 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
# Summary for the manual transmission
summary(mtcars[mtcars$am==1,])
## mpg cyl disp hp drat
## Min. :15.00 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :3.54
## 1st Qu.:21.00 1st Qu.:4.000 1st Qu.: 79.0 1st Qu.: 66.0 1st Qu.:3.85
## Median :22.80 Median :4.000 Median :120.3 Median :109.0 Median :4.08
## Mean :24.39 Mean :5.077 Mean :143.5 Mean :126.8 Mean :4.05
## 3rd Qu.:30.40 3rd Qu.:6.000 3rd Qu.:160.0 3rd Qu.:113.0 3rd Qu.:4.22
## Max. :33.90 Max. :8.000 Max. :351.0 Max. :335.0 Max. :4.93
## wt qsec vs am gear
## Min. :1.513 Min. :14.50 Min. :0.0000 Min. :1 Min. :4.000
## 1st Qu.:1.935 1st Qu.:16.46 1st Qu.:0.0000 1st Qu.:1 1st Qu.:4.000
## Median :2.320 Median :17.02 Median :1.0000 Median :1 Median :4.000
## Mean :2.411 Mean :17.36 Mean :0.5385 Mean :1 Mean :4.385
## 3rd Qu.:2.780 3rd Qu.:18.61 3rd Qu.:1.0000 3rd Qu.:1 3rd Qu.:5.000
## Max. :3.570 Max. :19.90 Max. :1.0000 Max. :1 Max. :5.000
## carb
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.923
## 3rd Qu.:4.000
## Max. :8.000
require(GGally)
# Multivariable Comparison
g <- ggpairs(mtcars)
g
fit <- lm(mpg ~ . , data = mtcars)
# Start the backward selection approach
require(MASS)
step <- stepAIC(fit, direction="backward")
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790